Goto

Collaborating Authors

 hyperparameter space





Learning to Mutate with Hypergradient Guided Population

Neural Information Processing Systems

Computing the gradient of model hyperparameters, i.e., hypergradient, enables a promising and natural way to solve the hyperparameter optimization task. However, gradient-based methods could lead to suboptimal solutions due to the non-convex nature of optimization in a complex hyperparameter space. In this study, we propose a hyperparameter mutation (HPM) algorithm to explicitly consider a learnable trade-off between using global and local search, where we adopt a population of student models to simultaneously explore the hyperparameter space guided by hypergradient and leverage a teacher model to mutate the underperforming students by exploiting the top ones. The teacher model is implemented with an attention mechanism and is used to learn a mutation schedule for different hyperparameters on the fly. Empirical evidence on synthetic functions is provided to show that HPM outperforms hypergradient significantly. Experiments on two benchmark datasets are also conducted to validate the effectiveness of the proposed HPM algorithm for training deep neural networks compared with several strong baselines.


Explaining Hyperparameter Optimization via Partial Dependence Plots

Neural Information Processing Systems

Most machine learning (ML) algorithms are highly configurable. Their hyperparameters must be chosen carefully, as their choice often impacts the model performance. Even for experts, it can be challenging to find well-performing hyperparameter configurations.


Efficient Hyperparameter Tuning via Trajectory Invariance Principle

Li, Bingrui, Wen, Jiaxin, Zhou, Zhanpeng, Zhu, Jun, Chen, Jianfei

arXiv.org Artificial Intelligence

As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.


Supplementary material A with for numerical features

Neural Information Processing Systems

We provide visual explanation of how embeddings are passed to MLP in Figure 2 and Figure 3. Also, We provide visualisation of target-aware PLE (subsubsection 3.2.2) in Figure 4. Figure 4: Obtaining bins for PLE from decision trees. We used the following datasets: Gesture Phase Prediction (Madeo et al. [27]) Churn Modeling We follow the pointwise approach to learning-to-rank and treat this ranking problem as a regression problem. In this section, we apply the quantile-based piecewise linear encoding (described in subsubsec-tion 3.2.1 to MLP and Transformer on the synthetic GBDT -friendly dataset described in section 5.1 The results are visualized in Figure 5. In this section, we test Fourier features implemented exactly as in Tancik et al. We mostly follow Gorishniy et al. [13] in terms of the tuning, training and evaluation protocols.



Supplementary material

Neural Information Processing Systems

All the experiments were conducted under the same conditions in terms of software versions. The feature preprocessing for DL models is described in the main text. The preprocessing is then applied to original features. The remaining notation follows those from the main text. For most experiments, training times can be found in the source code.


Comparison of Optimised Geometric Deep Learning Architectures, over Varying Toxicological Assay Data Environments

Kalian, Alexander D., Otte, Lennart, Lee, Jaewook, Benfenati, Emilio, Dorne, Jean-Lou C. M., Potter, Claire, Osborne, Olivia J., Guo, Miao, Hogstrand, Christer

arXiv.org Artificial Intelligence

Geometric deep learning is an emerging technique in Artificial Intelligence (AI) driven chemi nformatics, however the unique implications of different Graph Neural Network (GNN) architectures are poorly explored, for this space . This study compared performance s of Graph Convolutional Networks (GCN s), Graph Attention Networks (GAT s) and Graph Isomorphism Networks (GINs), applied to 7 different toxicological assay datasets of varying data abundance and endpoint, to perform binary classification of assay activation. Following pre - processing of molecular graphs, enforcement of class - balance and stratif ication of all datasets across 5 folds, Bayesian optimisations were carried out, for each GNN applied to each assay dataset (resulting in 21 unique Bayesian optimisations) . Optimised GNN s performed at Area Under the Curve (AUC) scores ranging from 0.728 - 0.849 (averaged across all folds), naturally varying between specific assays and GN Ns . GINs were found to consistently outperform GCNs and GAT s, for the top 5 of 7 most data - abundant toxicological assays . GAT s however significantly outperformed over the remaining 2 most data - scarce assays . This indicates that GINs are a more optimal architecture for data - abundant environments, whereas GAT s are a more optimal architecture for data - scarce environments . Subsequent analysis of the explored higher - dimensional hyperparameter spac es, as well as optimised hyperparameter states, found that GCNs and GAT s reached measurably closer optimised state s with each other, compared to GINs, further indicating the unique nature of GINs as a GNN algorithm .